arXiv:1506.02693v2 [cs.IT] 31Jul2016 


Approximate Message Passing Algorithm 
with Universal Denoising 
and Gaussian Mixture Learning 

Yanting Ma, Student Member, IEEE, Junan Zhu, Student Member, IEEE, 
and Dror Baron, Senior Member, IEEE 


Abstract 

We study compressed sensing (CS) signal reconstruction problems where an input signal is measured via matrix 
multiplication under additive white Gaussian noise. Our signals are assumed to he stationary and ergodic, hut 
the input statistics are unknown; the goal is to provide reconstruction algorithms that are universal to the input 
statistics. We present a novel algorithmic framework that combines: (i) the approximate message passing (AMP) CS 
reconstruction framework, which solves the matrix channel recovery problem by iterative scalar channel denoising; (ii) 
a universal denoising scheme based on context quantization, which partitions the stationary ergodic signal denoising 
into independent and identically distributed (i.i.d.) subsequence denoising; and {Hi) a density estimation approach that 
approximates the probability distribution of an i.i.d. sequence by fitting a Gaussian mixture (GM) model. In addition to 
the algorithmic framework, we provide three contributions: (i) numerical results showing that state evolution holds for 
non-separable Bayesian sliding-window denoisers; (li) an i.i.d. denoiser based on a modified GM learning algorithm; 
and {Hi) a universal denoiser that does not need information about the range where the input takes values from or 
require the input signal to be bounded. We provide two implementations of our universal CS recovery algorithm 
with one being faster and the other being more accurate. The two implementations compare favorably with existing 
universal reconstruction algorithms in terms of both reconstruction quality and runtime. 
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1. Introduction 

A. Motivation 

Many scientific and engineering problems can be approximated as linear systems of the form 

y = Ax + z, (1) 

where x G is the unknown input signal, A € is the matrix that characterizes the linear system, 

and z G is measurement noise. The goal is to estimate x from the measurements y given A and statistical 
information about z. When M N, the setup is known as compressed sensing (CS); by posing a sparsity or 
compressibility requirement on the signal, it is indeed possible to accurately recover x from the ill-posed linear 
system [2,3]. However, we might need M > N when the signal is dense or the noise is substantial. 

One popular scheme to solve the CS recovery problem is LASSO [4] (also known as basis pursuit denoising [5]): 
X = argminxgRiv ^||y — AxUl + 7 |jx|ji, where || • ||p denotes the fp-norm, and 7 reflects a trade-off between 
the sparsity ||x||i and residual ||y — Ax|| 2 . This approach does not require statistical information about x and z, 
and can be conveniently solved via standard convex optimization tools or the approximate message passing (AMP) 
algorithm [ 6 ]. However, the reconstruction quality is often far from optimal in terms of mean square error (MSB). 

Bayesian CS recovery algorithms based on message passing [7-9] usually achieve better reconstruction quality, 
but must know the prior for x. For parametric signals with unknown parameters, one can infer the parameters and 
achieve the minimum mean square error (MMSE) in some settings; examples include EM-GM-AMP-MOS [10], 
turboGAMP [11], and adaptive-GAMP [12]. 

Unfortunately, possible uncertainty about the input statistics may make it difficult to select a model class for 
empirical Bayes algorithms; a mismatched model can yield excess mean square error (EMSE) above the MMSE, 
and the EMSE can get amplified in linear inverse problems (1) compared to that in scalar estimation problems [13]. 
Our goal is to develop universal schemes that approach the optimal Bayesian performance for stationary ergodic 
signals despite not knowing the input statistics. Although others have worked on CS algorithms for independent and 
identically distributed (i.i.d.) signals with unknown distributions [ 10 ], we are particularly interested in developing 
algorithms for signals that may not be well approximated by i.i.d. models, because real-world signals often contain 
dependencies between different entries. For example, we will see in Fig. 6 that a chirp sound clip is reconstructed 1-2 
dB better with models that can capture such dependencies than i.i.d. models applied to sparse transform coefficients. 

While approaches based on Kolmogorov complexity [14—17] are theoretically appealing for universal signal 
recovery, they are not computable in practice [18,19]. Several algorithms based on Markov chain Monte Carlo 
(MCMC) [20-23] leverage the fact that for stationary ergodic signals, both the per-symbol empirical entropy and 
Kolmogorov complexity converge asymptotically almost surely to the entropy rate of the signal [18], and aim to 
minimize the empirical entropy. The best existing implementation of the MCMC approach [23] often achieves an 
MSB that is within 3 dB of the MMSE, which resembles a result by Donoho for universal denoising [14]. 

In this paper, we confine our attention to the system model defined in (1), where the input signal x is stationary 
and ergodic. We merge concepts from AMP [ 6 ], Gaussian mixture (GM) learning [24] for density estimation, and 
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Fig. 1. Flow chart of AMP-UD. AMP (2, 3) decouples the linear inverse problem into scalar channel denoising problems. In the f-th iteration, 
the universal denoiser »)univ,t(') converts stationary ergodic signal denoising into i.i.d. subsequence denoising. Each i.i.d. denoiser »)iid,t(') (13) 
outputs the denoised subsequence and the derivative of the denoiser ^(•) (16). The algorithm stops when the iteration index t reaches 

the predefined maximum tMax, and outputs Ximm as the CS recovery result. 


universal denoising for stationary ergodic signals [25,26]. We call the resulting universal CS recovery algorithm 
AMP-UD (AMP with a universal denoiser). Two implementations of AMP-UD are provided, and they compare 
favorably with existing universal approaches in terms of reconstruction quality and runtime. 

B. Related work and main results 

Approximate message passing: AMP is an iterative algorithm that solves a linear inverse problem by successively 
converting matrix channel problems into scalar channel denoising problems with additive white Gaussian noise 
(AWGN). AMP has received considerable attention, because of its fast convergence and the state evolution (SE) 
formalism [6,27], which offers a precise characterization of the AWGN denoising problem in each iteration. AMP 
with separable denoisers has been rigorously proved to obey SE [27]. 

The focus of this paper is the reconstruction of signals that are not necessarily i.i.d., and so we need to explore 
non-separable denoisers. Donoho et al. [28] provide numerical results demonstrating that SE accurately predicts the 
phase transition of AMP when some well-behaved non-separable minimax denoisers are applied, and conjecture that 
SE holds for AMP with a broader class of denoisers. A compressive imaging algorithm that applies non-separable 
image denoisers within AMP appears in Tan et al. [29]. Rush et al. [30] apply AMP to sparse superposition 
decoding, and prove that SE holds for AMP with certain block-separable denoisers and that such an AMP-based 
decoder achieves channel capacity. A potential challenge of implementing AMP is to obtain the Onsager correction 
term [6], which involves the calculation of the derivative of a denoiser. Metzler et al. [31] leverage a Monte Carlo 
technique to approximate the derivative of a denoiser when an explicit analytical formulation of the denoiser is 
unavailable, and provide numerical results showing that SE holds for AMP with their approximation. 

Despite the encouraging results for using non-separable denoisers within AMP, a rigorous proof that SE holds 
for general non-separable denoisers has yet to appear. Consequently, new evidence showing that AMP obeys SE 
may increase the community’s confidence about using non-separable denoisers within AMP. Our first contribution 
is that we provide numerical results showing that SE holds for non-separable Bayesian sliding-window denoisers. 

Fitting Gaussian mixture models: Eigueiredo and Jain [24] propose an unsupervised GM learning algorithm that 
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fits a given data sequence with a GM model. The algorithm employs a cost function that resembles the minimum 
message length criterion, and the parameters are learned via expectation-maximization (EM). 

Our GM fitting problem involves estimating the probability density function (pdf) of a sequence x from its 
AWGN corrupted observations. We modify the GM fitting algorithm [24], so that a GM model can be learned from 
noisy data. Once the estimated pdf px of x is available, we estimate x by computing the conditional expectation 
with the estimated pdf px (recall that MMSE estimators rely on conditional expectation). Our second contribution 
is that we modify the GM learning algorithm, and extend it to an i.i.d. denoiser 

Universal denoising: Our denoiser for stationary ergodic signals is inspired by a context quantization ap¬ 
proach [26], where a universal denoiser for a stationary ergodic signal involves multiple i.i.d. denoisers for condi¬ 
tionally i.i.d. subsequences. Sivaramakrishnan and Weissman [26] have shown that their universal denoiser based 
on context quantization can achieve the MMSE asymptotically for stationary ergodic signals with known bounds. 

The boundedness condition of Sivaramakrishnan and Weissman [26] is partly due to their density estimation 
approach, in which the empirical distribution function is obtained by quantizing the bounded range of the signal. 
Such boundedness conditions may be undesirable in certain applications. We overcome this limitation by replacing 
their density estimation approach with GM model learning. Our third contribution is a universal denoiser that does 
not need information about the bounds or require the input signal to be bounded; we conjecture that our universal 
denoiser achieves the MMSE asymptotically under some technical conditions. 

A flow chart of AMP-UD, which employs the AMP framework, along with our modified universal denoiser 
(Puniv) and the GM-based i.i.d. denoiser (piid), is shown in Pig. 1. Based on the numerical evidence that SE holds 
for AMP with Bayesian sliding-window denoisers and the conjecture that our universal denoiser can achieve the 
MMSE, we further conjecture that AMP-UD achieves the MMSE under some technical conditions. The details of 
AMP-UD, including two practical implementations, are developed in Sections II-V. 

The remainder of the paper is arranged as follows. In Section II, we review AMP and provide new numerical 
evidence that AMP obeys SE with non-separable denoisers. Section III modifies the GM fitting algorithm, and 
extends it to an i.i.d. denoiser. In Section IV, we extend the universal denoiser based on context quantization to 
overcome the boundedness condition, and two implementations are provided to improve denoising quality. Our 
proposed AMP-UD algorithm is summarized in Section V. Numerical results are shown in Section VI, and we 
conclude the paper in Section VII. 

II. Approximate message passing with sliding-window denoisers 

In this section, we apply non-separable Bayesian sliding-window denoisers within AMP, and provide numerical 
evidence that state evolution (SE) holds for AMP with this class of denoisers. 

A. Review of AMP 

Consider a linear system (1), where the measurement matrix A has zero mean i.i.d. Gaussian entries with unit- 
norm columns on average, and z represents i.i.d. Gaussian noise with pdf pzizi) = where Zi is the 
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i-th entry of the vector z, and Af{x; /r, cr^) denotes a Gaussian pdf: 

Note that AMP has been proved to follow SE when A is a zero mean i.i.d. Gaussian matrix, but may diverge 
otherwise. Several techniques have been proposed to improve the convergence of AMP [32-35]. Moreover, other 
noise distributions can be supported using generalized AMP (GAMP) [9], and the noise distribution can be estimated 
in each GAMP iteration [12]. Such generalizations are beyond the scope of this work. 

Starting with Xq = 0, the AMP algorithm [6] proceeds iteratively according to 


xt+i = r]t{A^rt+Xt), 


r* = y - Axt + 


1 

R 






( 2 ) 

(3) 


where R = M/N represents the measurement rate, t represents the iteration index, r]t{-) is a denoising function, 
and (u) = Ui for some vector u € The last term in (3) is called the Onsager correction term 

in statistical physics. The empirical distribution of x is assumed to converge to some probability distribution 
px on M, and the denoising function r]t{-) is separable in the original derivation of AMP [6,27,36]. That is, 
r]t{u) = {r]t{ui),T]t{u 2 ),...,r]t{uN)) and ? 7 [(u) = {'q[(ui),r][{u 2 ), where denotes the derivative 
of A useful property of AMP is that at each iteration, the vector + x* € in (2) is statistically 

equivalent to the input signal x corrupted by AWGN, where the noise variance evolves following SE in the 
limit of large systems {N —)■ oo, M/N —> R): 


<7(^1 = cr^ + 


(4) 


where MSE{r}t^ = Ex,w + (JtW) — , W ^ Af{w; 0,1), X ~ px, and erg = Formal 

statements for SE appear in the reference papers [27,36]. Additionally, it is convenient to use the following estimator 
for af [27,36]: 




(5) 


B. State evolution for Bayesian sliding-window denoisers 

SE allows to calculate the asymptotic MSE of linear systems from the MSE of the denoiser used within AMP. 
Therefore, knowing that SE holds for AMP with the denoisers that we are interested in can help us choose a good 
denoiser for AMP. It has been conjectured by Donoho et al. [28] that AMP with a wide range of non-separable 
denoisers obeys SE. We now provide new evidence to support this conjecture by constructing non-separable Bayesian 
denoisers within a sliding-window denoising scheme for two stationary ergodic Markov signal models, and showing 
that SE accurately predicts the performance of AMP with this class of denoisers for large signal dimension N. 
Note that for a signal that is generated by a stationary ergodic process, its empirical distribution converges to the 
stationary distribution, hence the condition on the input signal in the proof for SE [27] is satisfied, and our goal is 
to numerically verify that SE holds for AMP with non-separable sliding-window denoisers for stationary ergodic 
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signal models. Our rationale for examining the SE performance of sliding-window denoisers is that the context 
quantization based universal denoiser [26], which will be used in Section IV, resembles a sliding-window denoiser. 
The mathematical model for an AWGN channel denoising problem is defined as 


q = x-f V, 


(6) 


where x G is the input signal, v G is AWGN with pdf pv{vi) = 0, cr^), and q G is a sequence 

of noisy observations. Note that we are interested in designing denoisers for AMP, and the noise variance of the 
scalar channel in each AMP iteration can be estimated as (5). Therefore, throughout the paper we assume that 
the noise variance is known when we discuss scalar channels. 

In a separable denoiser, Xj is estimated only from its noisy observation qj. The separable Bayesian denoiser that 
minimizes the MSE is point-wise conditional expectation, 

Xj = E[X|Q = Qj] = J xp{x\qj)dx, (7) 

where Bayes’ rule yields p{x\qj) = . If entries of the input signal x are drawn independently from 

px, then (7) achieves the MMSE. 

When there are statistical dependencies among the entries of x, a sliding-window scheme can be applied to 
improve the MSE. We consider two Markov sources as examples that contain statistical dependencies, and emphasize 
that our true motivation is the richer class of stationary ergodic sources. 

Example source 1: Consider a two-state Markov state machine that contains states sq (zero state in which 
the signal entries are zero) and Si (nonzero state in which entries are nonzero). The transition probabilities are 
Pio = p('So|si) and poi = 7 '('Si|'So)- In the steady state, the marginal probability of state si is • We call our 

first example source Markov-Gaussian (MGauss for short); it is generated by the two-state Markov machine with 
Poi = gfo ™tl pio = and in the nonzero state the signal value follows a Gaussian distribution Af{x; pxtO'x)- 
These state transition parameters yield 3% nonzero entries in an MGauss signal on average. 

Example source 2: Our second example is a four-state Markov switching signal (M4 for short) that follows 
the pattern -1-1, -fl, —1, —1, -fl, -fl, —1, —1... with 3% error probability in state transitions, resulting in the signal 
switching from —1 to -fl or vice versa either too early or too late; the four states si = [—1 — 1 ], S 2 = [—1 + 1 ], 
S 3 = [-1-1 — 1], and S 4 = [-1-1 -I- 1] have equal marginal probabilities 0.25 in the steady state. 

Bayesian sliding-window denoiser: Let 0 be a binary vector, where Oi = 0 indicates Xi = 0, and 9i = 1 
indicates Xi 7 ^ 0. Denoting a block {us,Us+i, of any sequence u by u* for s < t, the {2k -I- 1)-Bayesian 

sliding-window denoiser r/MCauss for the MGauss signal is defined as 






E 


j+k 


n h{q„ep,p,x,<jl,(jl)p@j+>‘{dj1il) 


Sj=si 




cr2 -f 0-2 

^ X ' ^ V 


{qj - Ma:) + , (8) 
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where 




\Af{qi;fix,crl + al), if 6»i = si 

1 A/'(gi; 0 , 0 - 2 ), jf 


j+k-l 

=p(6'j-fc) ]J pi9^+i\0,), 

i=j-k 

3+k 

n h{q„9i-px,(Tl), 


i-j-k 


and the summation is over G {sojSi}^^^^ 
The MSB of ?7MGaussj is 

MSE(7?MGauss,Cr2) = E 


^Xj 77MGauss,j (Qj_j,)^ 


Poiial+nl) 

Poi +PlO 


/R2fc + 1 


^MGauss,j (*d)PQj+j (q)rfq- 


Similarly, the {2k + 1)-Bayesian sliding-window denoiser r]M 4 for the M4 signal is defined as 


(9) 


m4,M^tl) = nx,\Q^tl = ci>+l] 

q'-t) +Px,,Q);^,(-i> qj-fc) ’ 


( 10 ) 


where 


j+k-2 

P^j + k{K^.'{il) = p{Xj-k,Xj-k+l) p{Xi+2\Xi+i,Xi), 

i=j-k 


3+k 

i=j-k 


where the summation is over G {“!> 1}2^+^ with Xj = x G {—1,1} fixed. 
It can be shown that 


MSE(?7m4, crl) = E 


= 4 


■\3+k 


PXj,Q^. + l 


/R2fc + 1 


pq^+’‘ (q) 


dq. 


( 11 ) 


If AMP with ?7MGauss or ? 7 m 4 obeys SE, then the noise variance af should evolve according to (4). As a consequence, 
the reconstruction error at iteration t can be predicted by evaluating (9) or (11) with being replaced by a^. 

Numerical evidence: We apply ryMCauss (8) within AMP for MGauss signals, and pm 4 (10) within AMP for M4 
signals. The window size 2fc +1 is chosen to be 1 or 3 for pMGauss^ and 1 or 5 for pm 4 - Note that when the window 
size is 1, r/MGauss and ? 7 m 4 become separable denoisers. The MSE predicted by SE is compared to the empirical 
MSB at each iteration where the input signal to noise ratio (SNR = 101og]^Q[(A^E[Ar2])/(MCT2)]) is 10 dB for both 
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Iteration 


Fig. 2. Top: Numerical verification of SE for AMP with 77MGauss (8) when the input is an MGauss signal. {N = 20000, R = M/N = 0.4, SNR = 
10 dB.) Bottom: Numerical verification of SE for AMP with 77^4 (10) when the input is an M4 signal. (N = 20000, R = 0.4, SNR = 10 dB.) 


MGauss and M4. It is shown in Fig. 2 for AMP with TyMCauss and ? 7 m 4 that the markers representing the empirical 
MSE track the lines predicted by SE, and that side-information from neighboring entries helps improve the MSE. 

Our SE results for the two Markov sources increase our conhdence that AMP with non-separable denoisers that 
incorporate information from neighboring entries will track SE. 

The reader may have noticed from Eig. 1 that the universal denoiser r/univd is acting as a set of separable denoisers 
However, the statistical information used by 77iid(-) is learned from subsequences of the noisy 

sequence qt, and the subsequencing result is determined by the neighborhood of each entry. The SE results for the 
Bayesian sliding-window denoisers motivate us to apply the universal denoiser within AMP for CS reconstruction 
of stationary ergodic signals with unknown input statistics. Indeed, the numerical results in Section VI show that 
AMP with a universal denoiser leads to a promising universal CS recovery algorithm. 

III. I.I.D. DENOISING VIA GAUSSIAN MIXTURE EITTING 

We will see in Section IV that context quantization maps the non-i.i.d. sequence q into conditionally independent 
subsequences, and now focus our attention on denoising the resulting i.i.d. subsequences. 

A. Background 

The pdf of a Gaussian mixture (GM) has the form: 

s 

p{x) (12) 

where S is the number of Gaussian components, and X]s=i = 1, so that p{x) is a proper pdf. 

Eigueiredo and Jain [24] propose to fit a GM model for a given data sequence by starting with some arbitrarily 
large S, and inferring the structure of the mixture by letting the mixing probabilities as of some components 
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be zero. This leads to an unsupervised learning algorithm that automatically determines the number of Gaussian 
components from data. This approach resembles the concept underlying the minimum message length (MML) 
criterion that selects the best overall model from the entire model space, which differs from model class selection 
based on the best model within each class.' This criterion can be interpreted as posing a Dirichlet prior on the 
mixing probability and perform maximum a poteriori estimation [24]. A component-wise EM algorithm that updates 
{tts, Msi sequentially in s is used to implement the MML-based approach. The main feature of the component¬ 
wise EM algorithm is that if a* is estimated as 0, then the s-th component is immediately removed, and the 
expectation is recalculated before moving to the estimation of the next component. 

B. Extension to denoising 

Consider the scalar channel denoising problem defined in (6) with an i.i.d. input. We propose to estimate x from 
its Gaussian noise corrupted observations q by posing a GM prior on x, and learning the parameters of the GM 
model with a modified version of the algorithm by Eigueiredo and Jain [24]. 

Initialization of EM: The EM algorithm must be initialized for each parameter, cr^}, s = 1,..., S. One 

may choose to initialize the Gaussian components with equal mixing probabilities and equal variances, and the initial 
value of the means are randomly sampled from the input data sequence [24], which in our case is the sequence of 
noisy observations q. However, in CS recovery problems, the input signal is often sparse, and it becomes difficult 
to correct the initial value if the initialized values are far from the truth. To see why a poor initialization might be 
problematic, consider the following scenario: a sparse binary signal that contains a few ones and is corrupted by 
Gaussian noise is sent to the algorithm. If the initialization levels of the /i^’s are all around zero, then the algorithm 
is likely to fit a Gaussian component with near-zero mean and large variance rather than two narrow Gaussian 
components, one of which has mean close to zero while the other has mean close to one. 

To address this issue, we modify the initialization to examine the maximal distance between each symbol of 
the input data sequence and the current initialization of the /i^’s. If the distance is greater than O.laq, then we 
add a Gaussian component whose mean is initialized as the value of the symbol being examined, where cr^ is 
the estimated variance of the noisy observations q. We found in our simulations that the modified initialization 
improves the accuracy of the density estimation, and speeds up the convergence of the EM algorithm; the details 
of the simulation are omitted for brevity. 

Parameter estimation from noisy data: Two possible modifications can be made to the original GM learning 
algorithm [24] that is designed for clean data. 

We first notice that the model for the noisy data is a GM convolved with Gaussian noise, which is a new GM 
with larger component variances. Hence, one approach is to use the original algorithm [24] to fit a GM to the noisy 
data, but to remove a component immediately during the EM iterations if the estimated component variance is much 

’All models with the same number of components belong to one model class, and different models within a model class have different 
parameters for each component. 
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smaller than the noise variance Specifically, during the parameter learning process, if a component has variance 
that is less than 0.2cr^, we assume that this low-variance component is spurious, and remove it from the mixture 
model. However, if the component variance is between 0.2cr^ and 0.9a^, then we force the component variance to 
be 0.9(7^ and let the algorithm keep tracking this component. For component variance greater than 0.9a^, we do not 
adjust the algorithm. The parameters 0.2 and 0.9 are chosen, because they provide reasonable MSB performance for 
a wide range of signals that we tested. These parameters are then fixed for our algorithm to generate the numerical 
results in Section VI. At the end of the parameter learning process, all remaining components with variances less 
than are set to have variances equal to a^. That said, when subtracting the noise variance from the Gaussian 
components of pg to obtain the components of px, we could have components with zero-valued variance, which 
yields deltas in px- Note that deltas are in general difficult to fit with a limited amount of observations, and our 
modification helps the algorithm estimate deltas. 

Another approach is to introduce latent variables that represent the underlying clean data, and estimate the 
parameters of the GM for the latent variables directly. Hence, similar to the original algorithm, a component is 
removed only when the estimated mixing probability is non-positive. It can be shown that the GM parameters are 
estimated as 


max 


Cts(f + 1) — 


^ 2=1 ^ 


Y, max| X; - l,o| 

:as>0 ^ i—1 ^ 


p-s(f + 1 ) — 


2=1 


N 

£ 




dl{t + l) = 


X -Ps{t + 1)) 


where 


E 

2 = 1 

Xs), . _ as(t)N(qi]'ils{t),ul+ds(tf) 

Vl g 1 


\t) = 






{t) = 


0=2(f) -Per- 
2 2 


iQi 


m' 


Detailed derivations appear in the Appendix. 

We found in our simulation that the first approach converges faster and leads to lower reconstruction error, 
especially for discrete-valued inputs. Therefore, the simulation results presented in Section VI use the first approach. 
Denoising: Once the parameters in (12) are estimated, we define a denoiser for i.i.d. signals as conditional 
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expectation: 


Vudiq) = E[X |(5 = q] 
s 

= ^E[X|(5 = g,comp = s]F(comp = s\Q = q) 

S^l 

S 

= E 


r(<3'-Ais) 


' S ' ^ V 


asN{q-,Hs:(Tl + al) 
ELi asM{q] Us, (tI + ct2) ’ 


where comp is the component index, and 


E[X |(3 = q, comp = s] = 


r(9-Ms) + Mj 


(13) 


is the Wiener filter for component s. 

We have verified numerically for several distributions and low to moderate noise levels that the denoising results 
obtained by the GM-based i.i.d. denoiser (13) approach the MMSE within a few hundredths of a dB. For example, 
the favorable reconstruction results for i.i.d. sparse Laplace signals in Fig. 3 show that the GM-based denoiser 
approaches the MMSE. 


IV. Universal denoising 

We have seen in Section III that an i.i.d. denoiser based on GM learning can denoise i.i.d. signals with 
unknown distributions. Our goal in this work is to reconstruct stationary ergodic signals that are not necessarily 
i.i.d. Sivaramakrishnan and Weissman [26] have proposed a universal denoising scheme for stationary ergodic 
signals with known bounds based on context quantization, where a stationary ergodic signal is partitioned into i.i.d. 
subsequences. In this section, we modify the context quantization scheme and apply the GM-based denoiser (13) 
to the i.i.d. subsequences, so that our universal denoiser can denoise stationary ergodic signals that are unbounded 
or with unknown bounds. 

A. Background 

Consider the denoising problem ( 6 ), where the input x is stationary ergodic. The main idea of the context 
quantization scheme [26] is to quantize the noisy symbols q to generate quantized contexts that are used to partition 
the unquantized symbols into subsequences. That is, given the noisy observations q € define the context of qj 
as Cj = q^+i] s for j = 1 -f fc,..., N—k, where [a; b] denotes the concatenation of the sequences a and 

b. For j<k or j>N — k + 1, the median value gmed of q is used as the missing symbols in the contexts. As an 
example for j = k, we only have k—1 symbols in q before qk, and so the first symbol in is missing; we define 
Cfe = [(/meci; qi~^; q^El" ^®otor quantization can then be applied to the context set C = {cj : j = 1,..., A^}, and 
each Cj is assigned a label Ij € {1,..., L} that represents the cluster that Cj belongs to. Finally, the L subsequences 
that consist of symbols from q with the same label are obtained by taking q^^^ = {qj : Ij = 1}, for I = 1, 

The symbols in each subsequence q^^^ are regarded as approximately conditionally identically distributed given 
the common quantized contexts. The rationale underlying this concept is that a sliding-window denoiser uses 
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information from the contexts to estimate the current symbol, and symbols with similar contexts in the noisy output 
of the scalar channel have similar contexts in the original signal. Therefore, symbols with similar contexts can be 
grouped together and denoised using the same denoiser. Note that Sivaramakrishnan and Weissman [26] propose a 
second subsequencing step, which further partitions each subsequence into smaller subsequences such that a symbol 
in a subsequence does not belong to the contexts of any other symbols in this subsequence. This step ensures that 
the symbols within each subsequence are mutually independent, which is crucial for theoretical analysis. However, 
for finite-length signals, small subsequences may occur, and they may not contain enough symbols to learn its 
empirical pdf well. Therefore, we omit this second subsequencing step in our implementations. 

In order to estimate the distribution of which is the clean subsequence corresponding to Sivaramakrish¬ 
nan and Weissman [26] first estimate the pdf of via kernel density estimation. They then quantize the range 
that Xi’s take values from and the levels of the empirical distribution function of x, and find a quantized distribution 
function that matches well. Once the distribution function of x^*^ is obtained, the conditional expectation of 
the symbols in the /-th subsequence can be calculated. 

For error metrics that satisfy some mild technical conditions, Sivaramakrishnan and Weissman [26] have proved 
for stationary ergodic signals with bounded components that their universal denoiser asymptotically achieves the 
optimal estimation error among all sliding-window denoising schemes despite not knowing the prior for the signal. 
When the error metric is square error, the optimal error is the MMSE. 

B. Extension to unbounded signals and signals with unknown bounds 

Sivaramakrishnan and Weissman [26] have shown that one can denoise a stationary ergodic signal by (/) grouping 
together symbols with similar contexts and (//) applying an i.i.d. denoiser to each group. Such a scheme is optimal 
in the limit of large signal dimension N. However, their denoiser assumes an input with known bounds, which 
might make it inapplicable to some real-world settings. In order to be able to estimate signals that take values 
from the entire real line, in step (//), we apply the GM learning algorithm for density estimation, which has been 
discussed in detail in Section III, and compute the conditional expectation with the estimated density as our i.i.d. 
denoiser. 

We now provide details about a modification made to step (/). The context set C is acquired in the same way 
as described in Section IV-A. Because the symbols in the context Cj € C that are closer in index to qj are likely 
to provide more information about Xj than the ones that are located further away, we add weights to the contexts 
before clustering. That is, for each Cj € C of length 2k, the weighted context is defined as 

c' = Cj © w, 

where © denotes a point-wise product, and the weights take values, 

g-/3(fc-fe,)^ fci = l,..,fc 
-/3(fc,-fc-l)^ k^ = k + l,...,2k 


Wki = 


e 


(14) 
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for some /3 > 0. While in noisier channels, it might be necessary to use information from longer contexts, 
comparatively short contexts could be sufficient for cleaner channels. Therefore, the exponential decay rate /3 
is made adaptive to the noise level in a way such that /3 increases with SNR. Specifically, /3 is chosen to be linear 
in SNR: 

/3 = 6i logio((llq||2/^ - + ^2, (15) 


where &i > 0 and 62 can be determined numerically. Specifically, we run the algorithm with a sufficiently large 
range of /? values for various input signals at various SNR levels and measurement rates. For each setting, we 
select the /3 that achieves the best reconstruction quality. Then, bi and 62 obtained using the least squares 
approach. Note that bi and 62 are fixed for all the simulation results presented in Section VI. If the parameters were 
tuned for each individual input signal, then the optimal parameter values might vary for different input signals, 
and the reconstruction quality might be improved. The simulation results in Section VI with fixed parameters 
show that when the parameters are slightly off from the individually optimal ones, the reconstruction quality of 
AMP-UD is still comparable or better than the prior art. We choose the linear relation because it is simple and 
fits well with our empirical optimal values for /?; other choices for P might be possible. The weighted context set 
C' = {c' : j = 1,...,A^} is then sent to a /c-means algorithm [37], and are obtained according 

to the labels determined via clustering. We can now apply the GM-based i.i.d. denoiser (13) to each subsequence 
separately. However, one potential problem is that the GM fitting algorithm might not provide a good estimate of 
the model when the number of data points is small. We propose two approaches to address this small cluster issue. 

Approach 1: Borrow members from nearby clusters. A post-processing step can be added to ensure that the 
pdf of is estimated from no less than T symbols. That is, if the size of which is denoted by B, is less 
than T, then T — B symbols in other clusters whose contexts are closest to the centroid of the current cluster are 
included to estimate the empirical pdf of q^*), while after the pdf is estimated, the extra symbols are removed, and 
only q(^) is denoised with the currently estimated pdf. We call UD with Approach 1 “UDl.”^ 

Approach 2: Merge statistically similar subsequences. An alternative approach is to merge subsequences 
iteratively according to their statistical characterizations. The idea is to find subsequences with pdfs that are close 
in Kullback-Leibler (KL) distance [18], and decide whether merging them can yield a better model according to 
the minimum description length (MDL) [38] criterion. Denote the iteration index for the merging process by h. 
After the fc-means algorithm, we have obtained a set of subsequences : I = 1 ,..., Lh}, where Lh is the current 
number of subsequences. A GM pdf is learned for each subsequence qj^^ The MDL cost for the current 
model is calculated as: 


MDL 

Ch 


Lh Lh Q (/) Lh (i) / T \ 

Y. Y >«s + 1 : ^ log (ill” 1)+2 ■ + io ■ i: ^ log . 


related approach is k-nearest neighbors, where for each symbol in q, we find T symbols whose contexts are nearest to that of the current 
symbol and estimate its pdf from the T symbols. The k-nearest neighbors approach requires to run the GM learning algorithm [24] N times in 
each AMP iteration, which significantly slows down the algorithm. 
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where is the i-th entry of the subsequence is the number of Gaussian components in the mixture 

model for subsequence Lg is the number of subsequences before the merging procedure, and is the number 
of subsequences in the initial set {qg ^ : / = 1,Lg} that are merged to form the subsequence The four terms 
in are interpreted as follows. The first term is the negative log likelihood of the entire noisy sequence q/j given 
the current GM models. The second term is the penalty for the number of parameters used to describe the model, 
where we have 3 parameters (a,/!, cr^) for each Gaussian component, and components for the subsequence 
qjj^ The third term arises from 2 bits that are used to encode rnf^ for I = 1,..., Lh, because our numerical results 

^ „(0 f T \ 

have shown that the number of Gaussian components rarely exceeds 4. In the fourth term, 2_//=i j 

the uncertainty that a subsequence from the initial set is mapped to qj^^ with probability /Lq, for I = 1, ...^Lh- 
Therefore, the fourth term is the coding length for mapping the Lq subsequences from the initial set to the current 
set. 

We then compute the KL distance between the pdf of q^^®^ and that of q^*^, for = 1, 



A symmetric Lh x Lh distance matrix D/i is obtained by letting its s-th row and f-th column be 

Suppose the smallest entry in the upper triangular part of D/j (not including the diagonal) is located in the s*-th 
row and f*-th column, then ^ and q|^* ^ are temporarily merged to form a new subsequence, and a new GM pdf 
is learned for the merged subsequence. We now have a new model with Lh+i = Lh — 1 GM pdfs, and the MDL 
criterion calculated for the new model. If is smaller than then we accept the new model, and 

calculate a new Lh+i x Lh+i distance matrix D/j+i; otherwise we keep the current model, and look for the next 
smallest entry in the upper triangular part of the current Lh x Lh distance matrix. The number of subsequences is 
decreased by at most one after each iteration, and the merging process ends when there is only one subsequence left, 
or the smallest KL distance between two GM pdfs is greater than some threshold, which is determined numerically. 
We call UD with Approach 2 “UD2.” 

We will see in Section VI that UD2 is more reliable than UDl in terms of MSE performance, whereas UDl 
is faster than UD2. This is because UD2 applies a more complicated (and thus slower) subsequencing procedure, 
which allows more accurate GM models to be fitted to subsequences. 

V. Proposed Universal CS recovery algorithm 

Combining the three components that have been discussed in Sections II-IV, we are now ready to introduce our 
proposed universal CS recovery algorithm AMP-UD. Note that the AMP-UD algorithm is designed for medium to 
large size problems. Specifically, the problem size should be large enough, such that the decoupling effect of AMP, 
which converts the compressed sensing problem to a series of scalar channel denoising problems, approximately 
holds, and that the statistical information about the input can be approximately estimated by the universal denoiser. 
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Consider a linear system (1), where the input signal x is stationary and ergodic with unknown distributions, and 
the matrix A has i.i.d. Gaussian entries. To estimate x from y given A, we apply AMP as defined in (2) and (3). 
In each iteration, AWGN corrupted observations, qt = xj + A^r^ = x + v, are obtained, where is estimated 
by (T^ (5). A subsequencing approach is applied to generate i.i.d. subsequences, where Approach 1 and Approach 
2 (Section IV-B) are two possible implementations. The GM-based i.i.d. denoiser (13) is then utilized to denoise 
each i.i.d. subsequence. 

To obtain the Onsager correction term in (3), we need to calculate the derivative of piid (13). For g € K, denoting 

.^2 


/(<?) = 0's + 


S 


;iq - frs) 




rl), 


we have that 


S^l 

5 




S =1 


<^s+K- 9Fs 


q- fj-s 


(^s{q - t^s)' 
O'.? + cr? 




O': 


Therefore, 


Viidiq) = 


nq)9{q) - f{q)9'{q) 
{9{qW 


(16) 


We highlight that AMP-UD is unaware of the input SNR and also unaware of the input statistics. The noise 
variance cr? in the scalar channel denoising problem is estimated by the average energy of the residual (5). The 
input’s statistical structure is learned by the universal denoiser without any prior assumptions. 

It has been proved [26] that the context quantization universal denoising scheme can asymptotically achieve the 
MMSE for stationary ergodic signals with known bounds. We have extended the scheme to unbounded signals in 
Sections III and W, and conjecture that our modified universal denoiser can asymptotically achieve the MMSE 
for unbounded stationary ergodic signals. AMP with MMSE-achieving separable denoisers has been proved to 
asymptotically achieve the MMSE in linear systems for i.i.d. inputs [27]. In Section II-B, we have provided numerical 
evidence that shows that SE holds for AMP with Bayesian sliding-window denoisers. Bayesian sliding-window 
denoisers with proper window-sizes are MMSE-achieving non-separable denoisers [26]. Given that our universal 
denoiser resembles a Bayesian sliding-window denoiser, we conjecture that AMP-UD can achieve the MMSE for 
stationary ergodic inputs in the limit of large linear systems where the matrix has i.i.d. random entries. Note that 
we have optimized the window-size for inputs of length N = 10000 via numerical experiments. We believe that the 
window size should increase with N, and leave the characterization of the optimal window size for future work. 


"VI. Numerical results 

We run AMP-UD 1 (AMP with UDl) and AMP-UD2 (AMP with UD2) in MATLAB on a Dell OPTIPLEX 9010 
running an Intel(R) Core™ i7-3770 with 16GB RAM, and test them utilizing different types of signals, including 
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10dB AMP-UD2 
10dB AMP-UD1 
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10dBEMGM-AMP-MOS 
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5dB AMP-UD2 
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Measurement rate (R) 


0.6 
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Fig. 3. Two AMP-UD implementations, SLA-MCMC, and EM-GM-AMP-MOS reconstruction results for an i.i.d. sparse Laplace signal as a 
function of measurement rate {R = M/N). Note that the SDR curves for the two AMP-UD implementations and EM-GM-AMP-MOS overlap 
the MMSE curve. {N = 10000, SNR = 5 dB or 10 dB.) 


synthetic signals, a chirp sound clip, and a speech signal, at various measurement rates and SNR levels, where we 
remind the reader that SNR is dehned in Section II-B. The input signal length N is 10000 for synthetic signals 
and roughly 10000 for the chirp sound clip and the speech signal. The context size 2k is chosen to be 12, and 
the contexts are weighted according to (14) and (15). The context quantization is implemented via the A:-means 
algorithm [37]. In order to avoid possible divergence of AMP-UD, possibly due to a bad GM htting, we employ a 
damping technique [32] to slow down the evolution. Specihcally, damping is an extra step in the AMP iteration (3); 
instead of updating the value of Xj+i by the output of the denoiser 77t(A^rt+Xt), a weighted sum of r]t{A'^rt+xt) 
and Xt is taken as follows, 

xt+i = Xr]t{A^rt + xj) + (1 - A)xt, 

for some A G (0,1]. 

Parameters for AMP-UDl: The number of clusters L is initialized as 10, and may become smaller if empty 
clusters occur. The lower bound T on the number of symbols required to learn the GM parameters is 256. The 
damping parameter A is 0.1, and we run 100 AMP iterations. 

Parameters for AMP-UDl: The initial number of clusters is set to be 30, and these clusters will be merged 
according to the scheme described in Section IV. Because each time when merging occurs, we need to apply the 
GM htting algorithm one more time to learn a new mixture model for the merged cluster, which is computationally 
demanding, we apply adaptive damping [34] to reduce the number of iterations required; the number of AMP 
iterations is set to be 30. The damping parameter is initialized to be 0.5, and will increase (decrease) within the 
range [0.01,0.5] if the value of the scalar channel noise estimator (5) decreases (increases). 

The recovery performance is evaluated by signal to distortion ratio (SDR = 10 log]^Q(E[A^]/MSE)), where the 
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Fig. 4. Two AMP-UD implementations, SLA-MCMC, and turboGAMP reconstruction results for a two-state Markov signal with nonzero entries 
drawn from a uniform distribution U[0, 1] as a function of measurement rate. Note that the SDR curves for the two AMP-UD implementations 
overlap at SNR = 5 dB, and they both overlap turboGAMP at SNR = 10 dB. {N = 10000, SNR = 5 dB or 10 dB.) 


MSE is averaged over 50 random draws of x. A, and z. 

We compare the performance of the two AMP-UD implementations to (i) the universal CS recovery algorithm 
SLA-MCMC [23]; and (ii) the empirical Bayesian message passing approaches EM-GM-AMP-MOS [10] for i.i.d. 
inputs and turboGAMP [11] for non-i.i.d. inputs. Note that EM-GM-AMP-MOS assumes during recovery that the 
input is i.i.d., whereas turboGAMP is designed for non-i.i.d. inputs with a known statistical model. We do not 
include results for other well-known CS algorithms such as compressive sensing matching pursuit (CoSaMP) [39], 
gradient projection for sparse reconstruction (GPSR) [40], or ii minimization [2,3], because their SDR performance 
is consistently weaker than the three algorithms being compared. 

Sparse Laplace signal (i.i.d.): We tested i.i.d. sparse Laplace signals that follow the distribution px(x) = 
0.03£(0,1) -f 0.97S{x), where £(0,1) denotes a Laplacian distribution with mean zero and variance one, and 
(5( ) is the delta function [41]. It is shown in Fig. 3 that the two AMP-UD implementations and EM-GM-AMP- 
MOS achieve the MMSE [42,43], whereas SLA-MCMC has weaker performance, because the MCMC approach 
is expected to sample from the posterior and its MSE is twice the MMSE [14,23]. 

Markov-uniform signal: Consider the two-state Markov state machine dehned in Section II-B with poi = gfo 
and pio = A Markov-uniform signal (MUnif for short) follows a uniform distribution ?7[0,1] at the nonzero 
state si. These parameters lead to 3% nonzero entries in an MUnif signal on average. It is shown in Fig. 4 that 
at low SNR, the two AMP-UD implementations achieve higher SDR than SLA-MCMC and turboGAMP. At high 
SNR, the two AMP-UD implementations and turboGAMP have similar SDR performance, and are slightly better 
than SLA-MCMC. We highlight that turboGAMP needs side information about the Markovian structure of the 
signal, whereas the two AMP-UD implementations and SLA-MCMC do not. 









18 



Fig. 5. Two AMP-UD implementations, SLA-MCMC, and turboGAMP reconstruction results for a dense two-state Markov signal with nonzero 
entries drawn from a Rademacher (dil) distribution as a function of measurement rate. {N = 10000, SNR = 10 dB or 15 dB.) 


Dense Markov-Rademacher signal: Consider the two-state Markov state machine dehned in Section II-B with 
Poi = ^ and Pro = ^ dense Markov Rademacher signal (MRad for short) takes values from {—with 
equal probability at si. These parameters lead to 30% nonzero entries in an MRad signal on average. Because 
the MRad signal is dense (non-sparse), we must measure it with somewhat larger measurement rates and SNRs 
than before. It is shown in Fig. 5 that the two AMP-UD implementations and SLA-MCMC have better overall 
performance than turboGAMP. AMP-UD 1 outperforms SLA-MCMC except for the lowest tested measurement rate 
at low SNR, whereas AMP-UD2 outperforms SLA-MCMC consistently. 

Chirp sound clip and speech signal: Our experiments up to this point use synthetic signals. We now evaluate the 
reconstruction quality of AMP-UD for two real-world signals. A “Chirp” sound clip and a speech signal are used. We 
cut a segment with length 9600 out of the “Chirp” and a segment with length 10560 out of the speech signal (denoted 
by x), and performed a short-time discrete cosine transform (DCT) with window size, number of DCT points, and 
hop size all being 32. The resulting short-time DCT coefficients matrix are then vectorized to form a coefficient 
vector 9. Denoting the short-time DCT matrix by W“^, we have 9 = W“^x. Therefore, we can rewrite (1) as 
y = + z, where $ = AW. Our goal is to reconstruct 9 from the measurements y and the matrix $. After we 

obtain the estimated coefficient vector 9, the estimated signal is calculated as x = W0. Although the coefficient 
vector 9 may exhibit some type of memory, it is not readily modeled in closed form, and so we cannot provide a valid 
model for turboGAMP [11]. Therefore, we use EM-GM-AMP-MOS [10] instead of turboGAMP [11]. The SDRs for 
the two AMP-UD implementations, SLA-MCMC and EM-GM-AMP-MOS [10] for the “Chirp” are plotted in Pig. 6 
and the speech signal in Fig. 7. We can see that both AMP-UD implementations outperform EM-GM-AMP-MOS 
consistently, which implies that the simple i.i.d. model is suboptimal for these two real-world signals. Moreover, 
AMP-UD2 provides comparable and in most cases higher SDR than SLA-MCMC, which indicates that AMP-UD2 
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Fig. 6. Two AMP-UD implementations, SLA-MCMC, and EM-GM-AMP-MOS reconstruction results for a chirp sound clip as a function of 
measurement rate. {N = 9600, SNR = 5 dB or 10 dB.) 


is more reliable in learning various statistical structures than SLA-MCMC. AMP-UD 1 is the fastest among the four 
algorithms, but it may have lower reconstruction quality than AMP-UD2 and SLA-MCMC, owing to poor selection 
of the subsequences. It is worth mentioning that we have also run simulations on an electrocardiograph (ECG) 
signal, and EM-GM-AMP-MOS achieved similar SDR as the two AMP-UD implementations, which indicates that 
an i.i.d. model might be adequate to represent the coefficients of the ECG signal; the plot is omitted for brevity. 

Runtime: The runtime of AMP-UD 1 and AMP-UD2 for MUnif, MRad, and the speech signal is typically under 
5 minutes and 10 minutes, respectively, but somewhat more for signals such as sparse Laplace and the chirp sound 
clip that require a large number of Gaussian components to be fit. Eor comparison, the runtime of SLA-MCMC 
is typically an hour, whereas typical runtimes of EM-GM-AMP-MOS and turboGAMP are 30 minutes. To further 
accelerate AMP, we could consider parallel computing. That is, after clustering, the Gaussian mixture learning 
algorithm can be implemented simultaneously in different processors. 

VII. Conclusion 

In this paper, we introduced a universal compressed sensing recovery algorithm AMP-UD that applies our 
proposed universal denoiser (UD) within approximate message passing (AMP). AMP-UD is designed to reconstruct 
stationary ergodic signals from noisy linear measurements. The performance of two AMP-UD implementations was 
evaluated via simulations, where it was shown that AMP-UD achieves favorable signal to distortion ratios compared 
to existing universal algorithms, and that its runtime is typically faster. 

AMP-UD combines three existing schemes: (/) AMP [6]; (ii) universal denoising [26]; and (Hi) a density estimation 
approach based on Gaussian mixture (GM) fitting [24]. In addition to the algorithmic framework, we provided three 
specific contributions. Eirst, we provided numerical results showing that SE holds for non-separable Bayesian 
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Fig. 7. Two AMP-UD implementations, SLA-MCMC, and EM-GM-AMP-MOS reconstruction results for a speech signal as a function of 
measurement rate. {N = 10560, SNR = 5 dB or 10 dB.) 

sliding-window denoisers. Second, we modified the GM learning algorithm, and extended it to an i.i.d. denoiser. 
Third, we designed a universal denoiser that does not need to know the bounds of the input or require the input 
signal to be bounded. Two implementations of the universal denoiser were provided, with one being faster and the 
other achieving better reconstruction quality in terms of signal to distortion ratio. 

There are numerous directions for future work. First, our current algorithm was designed to minimize the square 
error, and the denoiser could be modihed to minimize other error metrics [44], Second, AMP-UD was designed to 
reconstruct one-dimensional signals. In order to support applications that process multi-dimensional signals such 
as images, it might be instructive to employ universal image denoisers within AMP. Third, the relation between 
the input length and the optimal window-size, as well as the exponential decay rate of the context weights, can 
be investigated. Finally, we can modify our work to support measurement noise with unknown distributions as an 
extension to adaptive generalized AMP [12]. 


Appendix 

We follow the derivation in Figueiredo and Jain [24]. Denoting 9 = {as, the MML-based criterion 

is 

Q 

>C(q,0) = ^ X! log(Ata,) + ^log(fV)-log(p(q|0)), (17) 

s:as>0 

where n = 2 is the number of parameters per Gaussian component, and Snz is the number of components with 
nonzero mixing probability as- The hrst term is the coding length of {^s,<^s}s=i^ because the expected number 
of data points that are from the s-th component is Nas, hence the effective sample size for estimating {p,s,tTg} 
is Nas- The second term is the coding length of a^’s, because ag’s are estimated from N data points. The third 
term is the coding length of the data sequence q. 
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The complete data expression for log(p(q|^)) is 
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log (p(q, X, z\e)) = log (p{q^, Xi, 4'^V)) 
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Replace log(p(q|0)) in (17) with log (p(q, x, z|0)); 
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Suppose 9{t) = {ai{t), estimate of the 6 at the f-th iteration. 
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where C is a constant that does not depend on 0. 
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E[log(p(q,X,Z|0))|q,0(i)] 
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where C is a constant that does not depend on 6. 
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To estimate {a^}, notice that we have the constraints 0 < ctg < 1, Vs and Es=i cts = 1- Collecting the terms in 
(19) that contain {as}, we have 
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which is the negative log likelihood of a quantity that is proportional to a Dirichlet pdf of (a^, ...,as„^), and its 
mode appears at 
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